Add evals/: schema-rejection and tool-retrieval regression coverage by rajeeja · Pull Request #62 · UXARRAY/uxarray-mcp-server

rajeeja · 2026-06-10T19:58:20Z

Summary

Two cheap, runnable evals under evals/ that turn behavior we care about into numbers we can re-measure on every PR.

evals/schema_rejection/ — 21 calls (19 deliberately malformed, 2 baselines). Classifies each outcome by layer (schema / IO / runtime / silent) and reports caught_rate. Currently 94.7% with 1 silent pass — plot_dataset(plot_type='variable') accepts a call with no variable_name and returns a plot anyway. That's a real bug surfaced by the eval; tracked separately, not fixed in this PR.
evals/tool_retrieval/ — BM25 over the full ~54-function tool surface against 30 labeled prompts. Reports top-1 / top-3 / top-5 selection accuracy and the mean rank of the correct tool. Currently 77% top-1, 87% top-3, 93% top-5.

Both runners complete in under 30 seconds with no external dependencies. Eval result JSON files are gitignored; the runners themselves are the source of truth.

evals/README.md explains what an eval is for a non-AI engineer and lists when to add one vs. when to write a unit test.

Test plan

uv run pre-commit run --all-files — passes.
uv run pytest tests/ --ignore=tests/test_remote_agent.py — 295 passed.
uv run python -m evals.schema_rejection.run — completes; 1 known silent-pass bug reported.
uv run python -m evals.tool_retrieval.run — completes; 77 / 87 / 93 numbers reproduce.

Two cheap, runnable evals that turn behavior we care about into numbers we can re-measure on every PR: - evals/schema_rejection/ — 21 calls (19 deliberately malformed, 2 baselines) classify each outcome by layer (schema / IO / runtime / silent). Headline number is caught_rate. Currently 94.7% with 1 silent pass (plot_dataset with plot_type='variable' but no variable_name still returns a plot). - evals/tool_retrieval/ — BM25 over the full ~54-function tool surface against 30 labeled prompts. Reports top-1 / top-3 / top-5 selection accuracy and mean rank of the correct tool. Currently 77% / 87% / 93%. Both runners run in under 30 seconds with no external dependencies. Result JSON files are gitignored; the runners are the source of truth. evals/README.md explains what an eval is for a non-AI engineer and lists when to add new ones vs. when to write a unit test instead.

…ually type Targeted the 7 tools that ranked worst in the BM25 retrieval eval — rewrote each first line to include the words a user would naturally use ("wireframe", "colored map", "ensemble", "time average", "is the endpoint healthy", "start a new session", "list variables") rather than internal jargon. evals/tool_retrieval results, same 30-prompt set: before: top-1 77%, top-3 87%, top-5 93%, mean rank 2.33, worst rank 19 after: top-1 93%, top-3 100%, top-5 100%, mean rank 1.07, worst rank 2 The two remaining rank-2 cases are genuinely ambiguous (plot_mesh vs. plot_mesh_geo; inspect_variable vs. get_capabilities) and the right ones land in the top-3 shortlist — which is what discover_tools will return. Tools touched: create_session, calculate_temporal_mean, calculate_ensemble_mean, diagnose_endpoint, inspect_variable, plot_mesh, plot_variable, plot_mesh_geo, get_capabilities. Behavior unchanged; only the leading docstring sentence moves. Pre-commit (including mypy) and the full test suite (295 tests) pass.

rajeeja added 2 commits June 10, 2026 14:58

rajeeja merged commit 1d9d947 into main Jun 10, 2026
7 checks passed

rajeeja deleted the rajeeja/evals branch June 10, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evals/: schema-rejection and tool-retrieval regression coverage#62

Add evals/: schema-rejection and tool-retrieval regression coverage#62
rajeeja merged 2 commits into
mainfrom
rajeeja/evals

rajeeja commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rajeeja commented Jun 10, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant